Zookeper Notes

By Yahoo!

Intro:
- Aims to provide a simple and high performance kernel for building coordination primitives in the client
- Provides per-client guarantee of FIFO execution and linearizability of requests
- The entire idea is instead of implementing synchronisation primitives server-side, they expose an API and allow clients to define their own
- Exposed API manipulates simple wait-free data objects organised hierarchially
- ZooKeeper resembles any other filesystem (Chubby without lock, open and close methods)
- Exposed API can implement consensus for any number of processes
- Pipelined architecture --> FIFO per clients for free
- Guaranteeing FIFO enables clients to submit requests asynchronously:
  - If a new leader is elected, it has to update metadata. Metadata updates are queued asynchronously and initialisation is of sub-second order (compared to synchronous initialisation)
- Zab --> Leader-based atomic broadcast protocol (to order things)
- Read operations are not totally ordered
- Client-side caching is used for things like leader id (observer pattern in cache):
  - Better than Chubby, since Chubby pauses updates to invalidate all caches (that use changed data)
  - Chubby uses leases to manage slow/faulty clients, but ZooKeeper avoids the problem altogether by allowing clients to manage cache
- Only writes are linearizable
The ZooKeeper service:
- Overview:
  - Client -- user of the ZooKeeper service
  - Server -- process providing the ZooKeeper service
  - Znode -- in-memory data node in the ZooKeeper data
  - Referring to a znode is done with standard UNIX notation of A/B/C
- Znodes:
  - Regular -- created and deleted by client explicitly
  - Ephemeral -- Created explicitly, deleted explicitly or when the creating session terminates
  - Znodes have a sequential flag set upon creation. If set it appends the value of a monothonically increasing counter to the znodes name
  - Unlike files, znodes are not designed for general data storage
  - Have associated metadata, timestamps and version counters (allows tracking and conditional updates)
- Watches (Observer subscriptions):
  - Update clients in a timely manner and avoid polling
  - Read operations have a watch flag. When set the server promises to notify the client when the information it has just returned (as part of a read) has changed
  - Watches are one-time triggers associated with a session (unregistered once triggered or the session closes)
  - Watches indicate that a change has happened but do not provide the change
- Data Model:
  - Filesystem (key/value storage with hierarchial keys) with full data reads and writes
  - Hierarchial namespace is useful for distinguishing applications and setting access rights
- Sessions:
  - Client connection initiates the session
  - Sessions have a timeout
  - Client marked as faulty if there are no updates from it in the timeout window
  - Session is ended when the client closes it, or marked as faulty
  - Within a session the client observes a succession of state changes (execution of its operations)
  - Sessions enable clients to move transparently from one server to another (persist across servers)
- Client API:
  - Allows creation, deletion and exists check on znodes
  - Get and set data stored in the znode
  - Get children of a znode
  - Waiting for all updates pending at the start of the operation to propagate to the server that the client is connected to (sync call).
  - All methods have both - synchronous and asynchronous versions available
  - ZooKeeper does not use handles to access znodes (every request uses the full path)
  - Each update method takes an "expected version number" --> enable conditional updates
  - Expected version -1 --> no version check on update
- Guarantees:
  - FIFO and linearizability (a-linearizability -- client is multithreaded)
  - Read requests are processed locally at each replica
  - Only updates are a-linearizable
  - Example: Leader election - delete the ready znode, update config, create ready
  - Liveness - if a majority of servers are active and communicating the service is available
  - Durability - if the service responds successfully to an update - the update persists
- Examples of primitives: (implementing more powerful primitives)
  - All primitives are implemented in the client
  - Configuration Management:
    1. Configuration is stored in znode A
    2. Processes start up with a full path to A
    3. Starting processes read A and set the watch flag to be true
    4. If A ever changes, they receive the notification and update their config accordingly
    5. After they update, they set the watch flag again
  - Rendezvous:
    1. Client creates a rendezvous node A
    2. Client passes the full path to A as a parameter to its peer
    3. When the peer starts, it fills A with all addresses and ports it is using
    4. When workers start they watch-read A
  - Group Membership:
    1. Use the fact that ephemeral nodes allow us to see the state of the session that created them
    2. Create znode A to represent the group
    3. When a process in A starts, it creates an ephemeral child of A (A/Ai)
    4. To ensure that names are unique (if not already), processes can use the sequential flag

Notes on ZooKeeper: Wait-free coordination for Internet-scale systems